An exploration of train delay data

Author
Affiliation
Published

2025-05-13

Purpose of this quarto-html

  • To explore the dataset from Zhang et al.
  • To try to determine if we can find a link between weather and train delays.

Dimensions

[1] 2751713      16

Column names

 [1] "date"                     "train_number"            
 [3] "train_direction"          "station_name"            
 [5] "station_order"            "scheduled_arrival_time"  
 [7] "scheduled_departure_time" "stop_time"               
 [9] "actual_arrival_time"      "actual_departure_time"   
[11] "arrival_delay"            "departure_delay"         
[13] "wind"                     "weather"                 
[15] "temperature"              "major_holiday"           

Summary statistics

station_name Mean_arriv Mean_depar stdev_arriv stdev_delay n unique_arriv unique_dep Mean_temp
Jianwei Railway Station 532.0000 532.0000 0.00000 0.00000 29 1 1 10.793103
Yuzhou Railway Station 531.4444 531.4444 75.67617 75.67617 36 3 3 5.805556
Guanyun Railway Station 500.0132 500.0132 178.21474 178.21474 151 9 9 6.622517
Fangcheng Railway Station 485.4444 485.4444 75.67617 75.67617 36 3 3 6.055556
Jieshounan Railway Station 465.2176 465.2176 359.13111 359.13111 239 18 18 5.941423
Xingandong Railway Station 416.3605 416.3605 212.42152 212.42152 147 10 10 11.476190
train_number Mean_arriv Mean_depar stdev_arriv stdev_delay n unique_arriv unique_dep Mean_temp
G4027 853.1429 696.7143 382.2505 542.9797 7 7 7 28.85714
G4919 840.0000 422.6667 653.4977 701.7814 6 3 3 23.66667
G4950 826.5000 642.6667 410.2257 578.8743 6 6 6 22.66667
G9252 811.0000 722.3077 253.9600 418.3552 13 13 13 19.92308
G4923 801.0000 531.0000 534.6631 640.0818 4 4 4 24.75000
G4966 661.2500 447.7500 411.7026 502.2102 8 4 4 22.00000

Average departure delays

Looking at the data (and not the summary stat)

By number of departures

Basic analysis of relation between weather and departure delays


Call:
aov(formula = departure_delay ~ weather, data = subset3)

Residuals:
   Min     1Q Median     3Q    Max 
-6.222 -0.775 -0.537  0.415 78.415 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   -0.41481    0.04576  -9.066  < 2e-16 ***
weatherdownpour               -0.13064    1.03119  -0.127  0.89919    
weatherfog                    -0.09265    0.41992  -0.221  0.82538    
weatherhaze                    0.41481    0.85540   0.485  0.62773    
weatherheavy snow              4.18404    0.54902   7.621 2.67e-14 ***
weatherlight rain              0.44859    0.09085   4.938 7.99e-07 ***
weatherlight snow              0.24500    0.23908   1.025  0.30549    
weatherlight to moderate rain  0.06270    0.40806   0.154  0.87788    
weathermoderate rain           2.02675    0.29868   6.786 1.20e-11 ***
weathermoderate snow           0.27592    0.57129   0.483  0.62911    
weathermoderate to heavy snow  5.63704    1.13982   4.946 7.67e-07 ***
weatherovercast                0.18985    0.08799   2.158  0.03096 *  
weathershowers                -0.08519    0.55615  -0.153  0.87826    
weathersleet                   1.52976    0.26303   5.816 6.15e-09 ***
weathersnow showers            1.41481    0.54902   2.577  0.00998 ** 
weathersunny                  -0.04837    0.06704  -0.721  0.47062    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.417 on 15230 degrees of freedom
Multiple R-squared:  0.01255,   Adjusted R-squared:  0.01158 
F-statistic: 12.91 on 15 and 15230 DF,  p-value: < 2.2e-16


Call:
aov(formula = departure_delay ~ wind_strength, data = subset3)

Residuals:
   Min     1Q Median     3Q    Max 
-3.968 -0.692 -0.668  0.332 78.308 

Coefficients:
                                       Estimate Std. Error t value Pr(>|t|)  
(Intercept)                             -0.1949     0.1413  -1.379   0.1678  
wind_strengthfresh breeze from the      -0.3315     0.2169  -1.528   0.1264  
wind_strengthgentle breeze from the     -0.1131     0.1510  -0.749   0.4536  
wind_strengthlight winds                -0.3197     0.1673  -1.911   0.0560 .
wind_strengthlight winds from the       -0.1373     0.1470  -0.934   0.3504  
wind_strengthmoderate breeze from the    0.3824     0.1692   2.260   0.0238 *
wind_strengthmoderate gale from the     -0.4718     1.9867  -0.237   0.8123  
wind_strengthstrong breeze from the      1.1632     0.4549   2.557   0.0106 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.432 on 15238 degrees of freedom
Multiple R-squared:  0.002933,  Adjusted R-squared:  0.002475 
F-statistic: 6.404 on 7 and 15238 DF,  p-value: 1.514e-07


    Welch Two Sample t-test

data:  departure_delay by wind_strength2
t = -3.9443, df = 1532, p-value = 8.363e-05
alternative hypothesis: true difference in means between group light winds and group strong winds is not equal to 0
95 percent confidence interval:
 -0.8456282 -0.2839107
sample estimates:
 mean in group light winds mean in group strong winds 
                -0.3445731                  0.2201964 

Focus on the stations and actual (positive) departure delays. Filtered data here.

Deselect any stations with few unique departure delays as this is more likely due to other circumstances.